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(54) Method and device for rating of speech quality 



(57) The present invention refers to a method and 
device for deciding quality of speech. The speech to be 
evaluated Is listened in to by a person who reproduces 
the speech. Stops of vowel sounds in he produced and 

reproduced speech respectively are appointed. The dif- 
ference between the stops of the vowel sounds is reg- 



istered. Out of the obtained differences an average val- 
ue is created. The achieved average value indicates the 
quality of the produced speech. The invention can be 
used for evaluation of different speech producing sourc- 
es such as equipments and/or machines and people's 
ability to comprehend the speech. 
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Description 

TECHNICAL FIELD 

The present invention refers to the rating of speech s 
quality in a given speech. The speech source which is 
analysed can be a synthetlzed speech or from different 
persons. 

STATE OF TECHNOLOGY 10 

Most methods for finding out the quality of synthetic 
speech at text-to-speech conversion are concentrated 
on the segmental realisation, by perception tests with 
nonsense words like for instance appa, ippi, agga etc, is 
This method says little or nothing about how good the 
synthetically produced speech is and how useful it is in 
applications. To solve this problem one has started stud- 
ying cognitive stress at the use of synthetic speech, for 
example by making the subject of the experiment per- 20 
form different tasks at the same time as he/she is ex- 
posed to information by synthetic speech, the content 
of which he/she has to give an account of. 

In synthetic speech the non-primary parameters are 
to a large extent lacking which results in that the inter- 2S 
acting parameters in many cases give a straight contra- 
dictory information, which results in that the comprehen- 
sion is lower than by natural speech. Especially in noisy 
environments the listener has a need of these non-pri- 
mary signal parameters which results in that the com- 30 
prehension of synthetic speech is drastically diminished 
in such surroundings. 

In patent document US 4672668 is described how 
a system pronounce a stored standard word with de- 
fined length, stress and rhythm. A person repeats the 35 
standard words and tries to simulate the length, stress 
and rhythm. The repeated words are detected and proc- 
essed for determining whether certain critera concern- 
ing identity of the standard words pronounced by the 
system are complied. If the repeated word complies with 40 
the criteria of identity it will be stored as a reference 
word. 

In the patent document US 5282475 is described a 
technology which is assigned to audiometry. A se- 
quence of speech stimuli is presented a person at which 45 
surveillance is made of at least one physiological an- 
swer from the human subjects of experiment which var- 
ies according to the subject's reception (understanding). 

In patent document US 5303327 is described a 
method according to which a verbal stimuli is presented so 
a person, after which the answer to the verbal stimuli is 
registered. The answers deal with statements and/or re- 
ceptivity. 

DESCRIPTION OF TH INVENTION TECHNICAL 55 
PROBLEM 

There is a need for evaluating total quality, inclusive 



prosody in for instance text-to-speech conversion. 

The methods used today for evaluating total quality 
are based on trials with a large number of persons. 
These persons deliver an opinion on the quality of the 
speech in question. There is a need to find methods 
which are automatic and do not need to use a number 
of persons participating in the evaluation. 

In situations where it is a question to chose between 
different speakers it can be of importance to find the 
speaker who is most easy to comprehend. Thus meth- 
ods for quick evaluation of such speakers and chosing 
the one who probably is most easy to comprehend is 
desirable. Further problems are that certain groups of 
people have more difficulties in perceiving speech than 
others. Even in this situation it is desirable to find meth- 
ods where a grading of the quality of a speech in relation 
to the capacity of the group of listener can be defined. 

Methods which are usable for synthetic speech and 
pathological speech are lacking at present. 
Possibilities for studying social handicap are also want- 
ed. 

SOLUTION 

The present invention refers to a method of deter- 
mining speech quality. A speech which is produced is 
being listen in to by a person who repeats the speech. 
The vowels of the produced and reproduced speech re- 
spectively are identified. Further the points of time for 
the start of each vowel sound are identified. A time dif- 
ference between the corresponding starts of vowel 
sound are established. The obtained time difference in- 
dicate the quality of the produced speech. 

The reproduction of the speech Is performed by a 
person being listening to the speech and verbally repro- 
ducing it as soon as possible. 

The speech is produced in a text-to-speech con- 
verter and consists of one in advance recorded mes- 
sage which is reproduced by for instance a tape record- 
er 

A reference to the quality of the produced speech 
is achieved by calibration of the system. This is per- 
formed by reading a speech with one in advance known 
quality. The person who repeats the calibration mes- 
sage will repeat the message with some delay in relation 
to the original message. In this way a reference is 
achieved, at which different person's repeating of the 
message are comparable. The calibration procedure 
permits that consideration can be taken to. for instance, 
a person's daily form. The method further allows that the 
speech quality of a text-to-speech converter, different 
persons, or human speech recorded on for instance a 
tape recorder, is possible to appoint. 

The invention further refers to a device for deciding 
speech quality. A device, 5, is arranged to produce a 
speech. The produced speech is analyzed and repro- 
duced by a function, 1 . A device, 7, appoints the starts 
of the vowel sounds in the produced och reproduced 
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speech respectively. In the device, 7, a time difference 
between the corresponding starts of vowel sounds in the 
produced and reproduced speech is registered. The 
time difference indicates a measure of the quality of the 
speech and Is via the device, 7, presentable. 

The device. 5 in figure 1. consists of a text-to- 
speech converter for production of a speech. Further, 
the function. 1 , consists of a person. He/she is listening 
in to the produced speech which will be repeated by the 
person. The person, 1, shall reproduce the reproduced 
speech as soon as possible after he/she has listened to 
it. In the device, 7, is arranged a time differential analysis 
equipment to appoint the time difference between the 
start of vowels in the produced and reproduced speech. 
The device, 1 , is further arranged to give a certificate of 
quality of the produced speech. The time difference 
equipment, 7, is further arranged to create an average 
value of the obtained time differences. The average val- 
ue indicates the quality of the produced speech. The de- 
vice, 1. is further arranged to comprise a first speech 
recognition equipment, 2, for appointing start of vowel 
sound in the produced speech. Further it comprises a 
second speech recognition equipment, 3, for appointing 
start of vowel sound In the reproduced speech. 

For calibration of the equipment is as calibration 
source used, 6. according to figure 3 and 4, which is 
arranged to be connected instead of device, 5. 

The calibration source is arranged to produce a 
speech the quality of which is known in advance. In this 
way a reference is obtained in relation to the person, 1 , 
who has been used for the reproduction of the speech. 
A reliable evaluation of the produced speech is thus ob- 
tained independent of the person, 1 . 

ADVANTAGES 

The present invention has the advantage of meas- 
uring speech quality including prosody. In previously 
known methods of measuring only segmented quality 
has been appointed. 

At the production of synthetic speech from a text 
different text-to-speech converters can be compared. 

The invention can be used for evaluating social 
handicap in connection with pathological speech. 

By having a speech with a given quality as a refer- 
ence a graded system for different speeches can be ob- 
tained. This is achieved by a number of reference 
speeches with, for instance, the grades very good, 
goood and poor being used. The given speech can after 
that at the analysis be appointed to belong to one of the 
mentioned categories. 

DESCRIPTON OF FIGURES 

Figure 1 shows the essential composition of the 
system. 

Figure 2 shows how the equipment, 5, is divided into 
one text analysis equipment 1 , 50, and one speech syn- 



thetizing equipment, 51. 

In figure 3 is shown how a reference equipment, 6, 
has been connected to the system and is reproduced 
by a person before the equipment, 5, is connected for 
5 an analysis of the given speech. 

Figure 4 shows the equivalent of figure 3 where the 
given speech is produced by a person and the repro- 
duction is performed by a person. 

Figure 5 shows the invention in the form of a flow 
^0 chart diagram. 

DETAILED EMBODIMENT 

In the following the invention is described with ref- 

^5 erence to the figures and the designations therein. 

According to figure 1 speech is produced in a device 
5. The speech is transferred in parallell to devices 1 and 
7. In device 1 the speech is listened in to and repro- 
duced. The produced and reproduced speech is trans- 

^0 ferred to a device 7. Analysis of the speeches then takes 
place and vowel sounds in each speech is identified. For 
each vowel sound the start of the vowel sound is ap- 
pointed. In device 7 points of time for start of vowel 
sounds in each speech is obtained. The points of time 

25 for the starts of the vowel sounds are analysed. 

The time difference between the starts of vowel 
sounds in the speeches is appointed. If it is supposed 
, that the starts of the vowel sound in the produced 
speech are marked VI , V2, V3 etc, and the starts of the 

30 vowel sounds in the reproduced speech are marked VV. 
V2', V3'etc the differences can be marked XI ., X2 etc, 
where XI = V1 V1 , X2 = V2'- V2 etc. The average value 
of these differences is achieved by 

35 ^ 

E(X)= l/NXxi 

The grading of the produced speech is obtained by 
the fact that the bigger the time delay in the reproduced 

'^o speech Is in relation to the produced speech, the worse 
is the understanding of the reproduced speech. The 
grading of the quality of the speech can for instance be 
referred to different time intervals within which the re- 
produced speech can be reproduced. 

^5 In figure 3 is furher shown how a speech is pro- 
duced in a text-to-speech converter 5. The speech is 
transferred to the analysis equipment 2, and to a person, 
1 , who has the duty to, as soon as possible, verbally 
reproduce the speech in a microphone which is connect- 

50 ed to the equipment 3. In the equipment 2 the starts of 
the vowel sounds in the produced speech are appoint- 
ed. In the equipment 3 the starts of the vowel sounds in 
the verbally reproduced speech are appointed. In the 
equipment 4 a difference between the starts of the vowel 

55 sounds of the produced speech and the reproduced 
speech is produced. A pecularity which can occur at the 
reproduction of speech with a person as reproducer is 
that a person out of the given speech and its delivery 
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can predict the coming speech. This means that the hu- 
man being at the reproduction of the speech In certain 
cases can reproduce the speech at the same time or 
even lie ahead of the speech production device. Also in 
this case a difference is created between the starts of 5 
the vowel sounds in the equipment 4. 

At the creation of the average value is it in this case 
possible to obtain an average which is close to 0 which 
indicates that the speech is very well understandable. 

By making different categories of people listen to io 
the same speech, different kinds of for instance im- 
paired hearing can be compared. Text to speech con- 
verters can in these cases in an adequate way be adapt- 
ed to the need of different person categories. For in- 
stance can persons with different kinds of impaired 
hearing be analysed and for those people suitable 
equipments be produced. 

For obtaining an adequate grading some form of 
reference system is required. In figure 3 such a system 
is shown where a reference equipment 6 is connected 20 
to the system. The text which in this case is read by the 
equipment is for instance categorized in advance by 
subjective measurements. Such subjective measure- 
ments are performed for instance in sound laboratories. 
Changing between the reference equipment and the tri- 
al equipment is made via the switch. The stored mes- 
sage in equipment 5 can for instance consist of mes- 
sages of different quality. The analys equipment re- 
ceives at the reading information about the quality of the 
present speech. This is notified at the reference analysis 30 
and the result is stored in a memory which is arranged 
in the analys equipment. A system with arbitrary division 
of the grading Is thus achieved. The 6 stored messages 
in the equipment preferbly consist of messages record- 
ed on tape or other resistant medium. What is important 35 
is that the reference messages are the same at different 
reference alternatives to make things comparable. The 
time difference between the starts of the vowels of the 
produced and the reproduced speech are appointed 
and an average is created according to the mentioned. 40 
The obtained average values at that indicate the tresh- 
hold for different grades at analysis of a speech. 

In figure 4 is shown how the reference equipment 6 
is connected and a person, 1, who reproduces the 
speech. After a reference evaluation has been made, in 
this case a person reading a text Is connected by switch- 
ing the swith. 

The person's, 5, verbal production is being listen in to 
and Is being reproduced by a person, 1 , and the speech- 
es are analysed as described above. By comparing the so 
starts of the vowel sounds in each speech respectively, 
and making an average of these as has previously been 
described, and compare the person's, 5, verbal produc- 
tion and the person's, 1 , ability to reproduce the per- 
son's, 5, speech and compare the obtained average val- ss 
ue with the average value for the reference equipment, 
is in equipment 4 obtained an evaluation of the speak- 
er's, 5, verbal production ability. 



Thus it is possible to, starting from a reference ap- 
pllcated to the reference equipment, find out whether a 
speaker's, 5, account can be reproduced and under- 
standable to another person In relation to a reference. 
The person, 1 , who repeats the speech can for instance 
be a person or a group of persons with different kinds 
of Impaired hearing. With the equipment Is in this case 
achieved a tool for selecting which person/persons shall 
speak to a certain kind of people. This can for instance 
be of crucial importance at lectures, lessons etc where 
persons with certain hearing handicap or other types of 
handicap are listener. It is in this case possible to tailor- 
make the lecturers/teachers. This can be of crucial im- 
portance for making a message to reach the listeners. 

In figure 2 is further shown how a text-to-speech 
converter 5, according to the previous decriptions can 
be realised. In this case there occurs an analysis of the 
text in the equipment 50. The text is transferred to a 
speech synthetizing equipment 51 . The speech synthe- 
tlzing equipment is after that producing a speech which 
corresponds to the given text. Both the text analysis 
equipment and the speech synthetizing equipment are 
since previously introduced on the market. A closer de- 
scription of these are not necessary since the profes- 
sionals In the field well know these equipments. 

Referring to the flow chart in figure 5 the function- 
ality of the invention can be described as first deciding 
whether calibration of the system shall be made or not. 
Depending on whether calibration shall be made or not, 
a speech with known quality is produced alternatively 
the speech to be analysed is produced. The produced 
speech Is being listened in to and reproduced. The starts 
of vowel sounds in the produced and reproduced 
speech respectively are appointed. The time difference 
between the starts of the vowel sounds in the speeches 
respectively Is appointed. After that the average value 
of mentioned differences are created. 

If the achieved average value creation is aiming at 
a calibration of the system, the obtained result is placed 
in a reference register 18. After that is decided whether 
more references are to be placed In the system. If that 
is the case next speech reference is taken out and the 
procedure according to previous description is repeat- 
ed. If all references have been gone through there is 
even In this case a restart. 

If, on the other hand, the obtained average value 
was directed towards an evaluation of a speech pro- 
duced by an equipment or a person, a comparison with 
values in the reference register Is after that performed. 
That reference value which is closest to the quality of 
the produced speech is appointed. The equipment after 
that presents the quality of the speech. After that is de- 
cided whether further evaluations is to be made or not. 
If no further evaluations shall be performed the proce- 
dure will be finished, otherwise the same procedure as 
above decribed is applied. 

If one arranges a person to listen in to read text and 
gives him/her the task to repeat the text, it turns out that 
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the time difference between the speech repeated by the 
subject of the experiment and the speech that is read 
for him/her Is not very big. Sometimes the subject of the 
experiment even lies ahead due to the redundancy in 
the sentences which makes him predict the incoming 
speech. The chance of predicting the continuation of the 
incoming speech is obviously due to how much informa- 
tion is received from start of the speech and up to the 
point of time in question. The signal parameters of the 
accoustic signal interact In one for the production appa- 
ratus and the human brain unique way, resulting in that 
the information is being multidlmensionally coded. Even 
not primary signal parameters are important for support- 
ing the interpretation of a statement. The prosody (into- 
nation) of the speech In the highest degree announces 
synthetic structure and interpretation of a statement. 
Synthetic speech is to a large extent lacking the non- 
primary signal parameters which causes the interacting 
parameters in many cases to give a straight contradic- 
tory information resulting in that the comprehenslbility is 
lower than in natural speech. Especially In noisy sur- 
roundings the listener is needing these non-primary sig- 
nal parameters which results In the comprehenslbility 
being drastically lower in such surroundings. 

By studying the time delay between the speech re- 
peated by the subject of the experiment and the speech 
that is read to him/her by naturally produced speech and 
synthetic speech one can classify the speech quality of 
the synthetic speech. Due to the fact that the time delay 
will vary in time is by automatic speech analysis decided 
the points of time of the start of the vowel segments in 
the read alternative of the by the synthetizer produced 
speech and the speech produced by the subject of the 
experiment. For each vowel in the speech string the time 
delay is appointed and the average delay calculated. 

The method can also be used for comparing the 
quality of the speech of different speakers, and at that 
for instance judge the social handicap for a person with 
speech disturbances. Comparisons between different 
text-to-speech converting equipments can also straight- 
ly be made. 

The invention is not confined to the above or below 
stated patent claims but can be subjected to modifica- 
tions within the frame of the idea of the Invention. 



Claims 



2. Method according to claim 1 , 
characterized in that the reproduction of the 
speech is made by a person listening in to the 
speech and verbally reproducing it. 

5 

3. Method according to claim 1 , 
characterized in that the speech is produced In a 
text-to-speech converter or that a person is reading 
a text, or that the speech consists of one in advance 

10 recorded message which Is reproduced by for In- 
stance a tape recorder 

4. Method according to claim 2, 
characterized In that a speech of known quality Is 

IS produced, at which a calibration with regard to who 
or what is reproducing the spech is obtained. 

5. Method according to claim 1 , 
characterized In that an average value of the time 

20 difference is created and that the average indicates 
the quality of the speech. 

6. Method according to claim 1 , 
characterized in that calibration is performed by a 
speech, the quality of which Is defined in advance, 
being used for appointing the time difference in the 
reproduced speech. 

Method according to claim 1 , 
characterized in that the comprehenslbility of dif- 
ferent sources of sound related to different catego- 
ries of persons, with for instance Impaired hearing, 
is definable, at which a categorization of different 
speech producing sources with regard to compre- 
henslbility is achieved. 

Device for deciding quality of speech, where a de- 
vice (5) is arranged to produce a speech, and a de- 
vice (1) is arranged to analyse and reproduce the 
speech characterized In that a device (7) Is ar- 
ranged to appoint starts of vowels In the produced 
and reproduced speech, that the device (5) is ar- 
ranged to register a time difference between corre- 
sponding starts of vowels in the produced and re- 
produced speech, and that the device on the basis 
of time difference is arranged to produce a measure 
of the quality of the produced speech. 



7. 

30 
35 

8. 

40 



1 . Method for deciding speech quality, where a speech 
is produced and listen in to, och the speech listen so 
in to Is reproduced characterized in that the points 
of time for the starts of vowel sound starts in the 
produced and reproduced speech respectively are 
appointed, and that the time difference between 
corresponding starts of vowel sounds in the pro- ss 
duced and reproduced speech respectively is ap- 
pointed and that the time difference indicates the 
quality of the produced speech. 



9. Device according to claim 1 , 

characterized in that the device (5) consists of a 
text-to-speech converter, device for reproduction of 
a recorded speech or a person. 

10. Device according to claim 9, 

characterized In that the device (1) that a person 
listens in to the produced speech and reproduces it 
verbally. 



BNSDOCIO: <EP ^0727767 A2J_> 



5 



9 EP 0 727 767 A2 



11. Device according to claim 9. 

characterized in that the device (7) is arranged to 

include a time difference analysis equipment (4) 
which registers the time difference between the 
stops of the vowel sounds in the produced and re- s 
produced speech, and Is arranged to give a quality 
grade of the produced speech. 

12. Device according to claim 12, 

characterized in that the time difference analysis io 
equipment (4) is arranged to create an average val- 
ue of the obtained time differences and that the av- 
erage value indicates the quality of the produced 
speech 
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